DATA1220-55, Fall 2024
2024-09-18
Instructions (homework2_instructions.pdf), a Quarto markdown template (homework2_template.qmd), and an example HTML output (homework2_example.html) are available for download under Chapter 2 on the Modules page in Canvas.
Upload TWO (2) documents to Homework 2 on the Assignments page in Canvas by Friday 9/20/2024 by 6:00pm: homework2_yourlastname.qmd and homework2_yourlastname.html
Video walk-through of Homework 2 under Tutorials on the Modules page in Canvas. Make sure you’re caught up on the video walk-through of homework 1.
“This homework is due by 6:00pm on Friday, 9/20/24. No credit will be lost for assignments received by 7:00pm to account for issues with uploading. 10% of the points will be deducted from assignments received by 9:00am on Saturday, 9/21/24. Assignments turned in after this point are only eligible for 50% credit, so it benefits you to turn in whatever you have completed by the due date.”
Read the textbook. Many of you are asking for additional examples. Luckily, there are tons we didn’t go over in the textbook.
Ask a question on our Campuswire class feed. I’m only one person, and I may not be able to give you a prompt answer. However, the 20+ other people in the class might be able to.
Come to office hours. I will be available after class today Wednesday 9/25/2024 from 2:30pm - 4:00pm. If you cannot make it, reach out to me to try and schedule an appointment.
Define probability, random processes, and the law of large numbers
Describe the sample space for disjoint and non-disjoint outcomes
Calculate probabilities using the General Addition and Multiplication Rules
Create a probability distribution for disjoint outcomes
What does the word probability mean to you?
“Highly likely”
“Probably”
“About even”
“Almost no chance”
Did your estimate fall within these ranges? Are these ranges reasonable?
Frequentist Definition
The proportion of times that a particular outcome would occur if we observed a random process an infinite number of times.
A random process is one where you know which outcomes are possible (i.e. the sample space) but you don’t know which outcome comes next
Examples of a random process: coin toss, die roll, stock market
Both Apple and Spotify took steps to make their “shuffle” features less random after complaints from users.
January 11, 2005 – Apple releases the iPod Shuffle, a small device capable only of playing music randomly (“true” shuffle)
July 2011 – Spotify launches in the United States using the Fisher-Yates Algorithm, which is like picking tickets out of a hat until no more remain
The human brain is good at finding patterns in noise, even when there are none
If an artist is repeated “too soon”, the listener doesn’t feel the order is random
We perceive a “random” distribution as also being “uniform” and “fair”
Songs not evenly distributed across albums and artists on a playlist
Some albums/artists may play more frequently than others simply because they have more songs in the library/on the playlist
Each song is equally likely to play next (uniform), but not each artist (not uniform)
Artists/albums with more songs also more likely to play in a row
A true random shuffle might play the same artist multiple times in a row
It’s unusual but not impossible to roll a 1 on a die 3 times in a row
It’s also possible for the same song to play twice in a row
Each time the song changes, every song on the playlist is eligible to be played next
Does not matter if the song was just played
Does not matter who the artist is
We call this sampling with replacement.
Like drawing a playing card, looking at it, then putting it back in the deck before the next draw
Repetition of outcomes is possible
There is some “true” real-world probability that the next song is by Chappell Roan
There is our “observed” probability that the next song is by Chappell Roan
The sample space \(s\) or \(S\) is the total collection of possible outcomes or events for a random process.
Die rolls: 1, 2, 3, 4, 5, 6
Coin flips: heads, tails
Stock market: up, down, no change
For this example, the sample space could be all the songs on the playlist (n = 50) or all the artists who perform them (n = 26).
In the sample space \(S\), the complement of event A occurring is event A not occurring. This is written as AC or A’.
Outcomes are disjoint or mutually exclusive if they cannot both happen at the same time
Taylor Swift and Adele did not collaborate on any songs on this playlist
The next song played can either be by Taylor Swift OR by Chappell Roan but not by Taylor Swift AND Chappell Roan
The events “The next song is by Taylor Swift” and “The next song is by Chappell Roan” are disjoint/mutually exclusive
Non-disjoint outcomes can occur at the same time.
The beaverduck from Tenso Graphics
Probabilities are proportions, or the number of observations with a particular value divided by…
the total number of observations in a sample (\(n\)) for the sample proportion (\(\hat{p}_n\))
the total number of outcomes in the sample space (\(s\)) for the population proportion (\(p\))
Proportions range from 0 (no observations/outcomes) to 1 (all observations/outcomes)
Also may be a percentage, ranging from 0% to 100% (multiply proportion by 100)
\[ \operatorname{Probability}(\operatorname{Event A})=\operatorname{P}(\operatorname{A}) \]
\[ \begin{aligned} \operatorname{Sample Probability}(\operatorname{A})&=\frac{\operatorname{count}(\operatorname{observation = A})}{\operatorname{count}(\operatorname{observations in sample})} \\ &=\hat{p}_n \end{aligned} \]
\[ \begin{aligned} \operatorname{Population P}(\operatorname{A})&=\frac{\operatorname{count}(\operatorname{event = A})}{\operatorname{count}(\operatorname{events in sample space})} \\ &=p \end{aligned} \]
\[ \begin{aligned} \operatorname{P}(\operatorname{S})&=1 \\ &= \operatorname{P}(\operatorname{A})+\operatorname{P}(\operatorname{A}^{\operatorname{C}}) \\ &= \operatorname{P}(\operatorname{A})+\operatorname{P}(\operatorname{A}^{\operatorname{'}}) \end{aligned} \]
\(p=\operatorname{Population Probability}(\operatorname{Next Song by Chappell Roan})\)
The sample space for the population probability that the next song is by Chappell Roan when there is “true” shuffle or sampling with replacement is all songs on the playlist (\(n=50\)).
Chappell Roan has 7 songs on the playlist, so the event “The next song is by Chappell Roan” occurs 7 times within the sample space.
The population probability \(p\) of the next song being by Chappell Roan is…
\[ \begin{aligned} p&=\operatorname{P}(\operatorname{Next Song by Chappell Roan}) \\ &=\frac{\operatorname{count}(\operatorname{Songs By Chappell Roan})}{\operatorname{count}(\operatorname{Total Possible Songs})} \\ &=\frac{7}{50} \\ &=0.14 \\ &= 14\% \end{aligned} \]
The sample space for the sample probability of the next song being by Chappell Roan when there is “true” shuffle or sampling with replacement is the number of songs listened to so far (\(n=1+\)).
Each time a Chappell Roan song is played, an event is counted / recorded.
The population probability of the next song being by Chappell Roan is…
\[ \begin{aligned} \hat{p}_n&=\operatorname{P}(\operatorname{Next Song by Chappell Roan}) \\ &=\frac{\operatorname{count}(\operatorname{Songs Heard By Chappell Roan})}{\operatorname{count}(\operatorname{Total Songs Heard})} \\ &=\frac{x}{n} \end{aligned} \]
Should we listen to 1 song?
Should we listen to 5 songs?
Should we listen to 10 songs?
Should we listen to 100 songs?
Should we listen to 200 songs?
How well the sample proportion \(\hat{p}_n\) represents the population proportion \(p\) depends on the size of the denominator.
As more observations are collected, the sample proportion \(\hat{p}_n\) of a particular outcome approaches the population proportion \(p\) of that outcome.
The sample proportion is an unreliable estimator of the population proportion when the sample size is small.
The sample proportion is a reliable estimator of the population proportion when the sample size is large.
DATA1220-55 Fall 2024, Class 09 | Updated: 2024-09-18 | Canvas | Campuswire